AITopics

2605.2201

Genre: Research Report (0.90)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Pronzato, Luc, Zhigljavsky, Anatoly

Non-asymptotic quantisation of spherically symmetric distributions

arXiv.org Machine LearningMay-14-2026

Zador's celebrated theorem is a cornerstone of optimal quantisation, establishing both the weak limit of the empirical distribution of an $n$-point optimal quantiser in $R^d$ and the decay rate of the associated $L_s$-mean quantisation error. However, for large dimensions $d$, observing this asymptotic behaviour demands an astronomically large sample size $n$, which grows super-exponentially with $d$. Through a detailed analysis of the quantisation problem for spherically symmetric distributions, we demonstrate that for moderate $n$ random quantisers uniformly distributed on a sphere of suitable radius $r$ achieve exceptional performance. The expected distortion, expressed as a triple integral, can be computed with arbitrary precision, and the optimal radius $r$ can be efficiently determined numerically. Leveraging results from extreme-value theory, we derive approximations for $r$, particularly in scenarios where $n$ scales with $d$. Depending on the growth rate of $n$, $r$ may either converge to zero or approach a limiting value that is independent of $s$.

artificial intelligence, quantiser, sd 1, (16 more...)

2605.12568

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence (0.46)

Awano, Ryoya, Suzuki, Taiji

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

arXiv.org Machine LearningMay-14-2026

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces $V_k$, and is fine-tuned under the supervision of a weak model specialized on task $κ$. We prove that the strong model efficiently learns task $κ$, eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the sense that the strong model acquires the target feature direction through W2S training, rather than having it given a priori. Moreover, W2S preserves pre-trained off-target features, whereas standard supervised fine-tuning causes catastrophic forgetting when off-target feature directions are correlated with the target's. Numerical experiments on synthetic data confirm our theoretical results.

high probability, machine learning, natural language, (17 more...)

2605.12908

Country: Asia (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

arXiv.org Machine LearningMay-13-2026

Explicit integral representations and quantitative bounds for two-layer ReLU networks

Lee, Anthony

An approach to construct explicit integral representations for two-layer ReLU networks is presented, which provides relatively simple representations for any multivariate polynomial. Quantitative bounds are provided for a particular, sharpened ReLU integral representation, which involves a harmonic extension and a projection. The bounds demonstrate that functions can be approximated with $L^{2}(\mathcal{D})$ errors that do not depend explicitly on dimension or degree, but rather the coefficients of their monomial expansions and the distribution $\mathcal{D}$. We also present a connection to the RKHS of the exponential kernel $K(x,y)=\exp\left(\left\langle x,y\right\rangle \right)$, and a very simple integral representation involving additionally multiplication via a fixed function which has better quantitative bounds.

artificial intelligence, machine learning, representation, (18 more...)

2604.2326

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Agazzi, Andrea, Bruno, Giuseppe, García, Eloy Mosig, Saviozzi, Samuele, Romito, Marco

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

arXiv.org Machine LearningApr-30-2026

The transformer architecture [52], which underlies present-day Large Language Models, has been one of the main drivers of recent advances in machine learning and artificial intelligence. At each layer, the hidden state of the network is updated by sequentially applying two distinct operations: attention modules [3], which capture long-range interactions in the input sequence, and classical MultiLayer Perceptrons (MLPs), acting separately on each element of that sequence. Despite their empirical success, the mechanisms governing information propagation through depth, and the way attention and MLP blocks jointly shape internal representations, remain only partially understood from a theoretical viewpoint. Recent progress has come from viewing transformers in suitable scaling limits as deterministic mean-field interacting particle systems modeling the evolution of N tokens1 through the layers of the neural network architecture (the so-called residual stream dynamics), see, among others, [46, 26, 27, 45]. In these descriptions, depth plays the role of a continuous time variable, and, in the large-context regime (N), the evolution of token representations is encoded by a PDE for their empirical distribution. This viewpoint is closely connected to the literature on scaling laws, where the effect of various scaling exponents controlling the relative size of the network's hyperparameters (e.g., depth, width, context length) on the effective dynamics of the model

lemma 2, machine learning, natural language, (19 more...)

2604.26898

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)

Neural Information Processing SystemsApr-27-2026, 21:48:07 GMT

Appendix

Our results heavily rely on the specific nature of the periodic activation function, so a natural question is to which extent our results can be extended beyond the single periodic neuron class. For lower bounds, a challenging but very interesting generalization would be to establish the cryptographic-hardness of learning certain family of GLMs whose activation function does not need to be periodic. A potentially easier route forward on this direction, would be to consider the Hermite decomposition of the activation function, similar to [A3], and establish lower bounds on the performance of low-degree methods [A23], of SGD [A3], or of local search methods methods [A15], for activation functions whose low-degree Hermite coefficients are exponentially small. For upper bounds, we believe that our proposed LLL-based algorithm may be extended beyond learning even periodic activation functions, such as the cosine activation, by appropriately post-processing the measurements, but leave this for future work. Furthermore, it would be interesting to better understand (empirically or analytically) the noise tolerance of our LLL-based algorithm for "low-frequency" activation functions, such as the absolute value underlying the phase retrieval problem which has "zero" frequency.

artificial intelligence, exp, machine learning, (16 more...)

Country: North America > United States (0.46)

Genre: Research Report (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsApr-26-2026, 17:39:48 GMT

Bandit Phase Retrieval

Tor Lattimore

We study a bandit version of phase retrieval where the learner chooses actions (At)nt=1 in the d-dimensional unit ball and the expected reward is hAt,?i2 with? 2 Rd an unknown parameter vector. We prove an upper bound on the minimax cumulative regret in this problem of (d p n), which matches known lower bounds up to logarithmic factors and improves on the best known upper bound by a factor of p d. We also show that the minimax simple regret is (d/ p n) and that this is only achievable by an adaptive algorithm. Our analysis shows that an apparently convincing heuristic for guessing lower bounds can be misleading and that uniform bounds on the information ratio for information-directed sampling [Russo and Van Roy, 2014] are not sufficient for optimal regret.

algorithm, artificial intelligence, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsApr-25-2026, 20:19:12 GMT

Deep Neural Networks as Point Estimates for Deep Gaussian Processes

Vincent Dutordoir

This section gives a brief overview of some of the useful properties of spherical harmonics. We refer the interested reader to Dai and Xu [55] and Efthimiou and Frye [56] for an in-depth overview. Spherical harmonics are special functions defined on a hypersphere and originate from solving Laplace's equation. They form a complete set of orthogonal functions, and any sufficiently regular function defined on the sphere can be written as a sum of these spherical harmonics, similar to the Fourier series with sines and cosines. Spherical harmonics have a natural ordering by increasing angular frequency.

artificial intelligence, machine learning, polynomial, (17 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Neural Information Processing SystemsApr-24-2026, 21:09:37 GMT

1680e9fa7b4dd5d62ece800239bb53bd-Supplemental.pdf

We analyze here briefly some basic notions of the geometry of the sphere that we use in our algorithm and convergence analysis. We refer the reader to [1, p. 73-76] for a more comprehensive presentation. Tangent Space: The tangent space of the r-dimensional sphere Sr at a point p is an r-dimensional vector space, which generalizes the notion of tangent plane in two dimensions. We denote it by TpSr and a vector v belongs in it, if and only if, it can be written as α(0), where α: ( ε,ε) Sr (for some ε > 0) is a smooth curve with α(0) = p. The tangent space at pcan be given also in an explicit way, as the set of all vectors in Rr+1 orthogonal to p with respect to the usual inner product.

artificial intelligence, gradf, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.36)

Neural Information Processing SystemsApr-24-2026, 21:09:34 GMT

Distributed Principal Component Analysis with Limited Communication

We study efficient distributed algorithms for the fundamental problem of principal component analysis and leading eigenvector computation on the sphere, when the data are randomly distributed among a set of computational nodes. We propose a new quantized variant of Riemannian gradient descent to solve this problem, and prove that the algorithm converges with high probability under a set of necessary spherical-convexity properties. We give bounds on the number of bits transmitted by the algorithm under common initialization schemes, and investigate the dependency on the problem dimension in each case.

algorithm, artificial intelligence, machine learning, (16 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Principal Component Analysis (0.61)